Comparison of Supervised Classification Methods for Protein Profiling in Cancer Diagnosis

Summary: A key challenge in clinical proteomics of cancer is the identification of biomarkers that could allow detection, diagnosis and prognosis of the diseases. Recent advances in mass spectrometry and proteomic instrumentations offer unique chance to rapidly identify these markers. These advances pose considerable challenges, similar to those created by microarray-based investigation, for the discovery of pattern of markers from high-dimensional data, specific to each pathologic state (e.g. normal vs cancer). We propose a three-step strategy to select important markers from high-dimensional mass spectrometry data using surface enhanced laser desorption/ionization (SELDI) technology. The first two steps are the selection of the most discriminating biomarkers with a construction of different classifiers. Finally, we compare and validate their performance and robustness using different supervised classification methods such as Support Vector Machine, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Neural Networks, Classification Trees and Boosting Trees. We show that the proposed method is suitable for analysing high-throughput proteomics data and that the combination of logistic regression and Linear Discriminant Analysis outperform other methods tested.


Introduction
Over recent years, scientifi c knowledge in cancer biology has progressed considerably. However, the practical impact of this research on screening, diagnosis, prognosis and monitoring remains limited. New methods must be developed to identify the physiological and pathological mechanisms in the origin and spread of tumors. Such approaches are essential for the discovery, identifi cation and validation of new bio-markers. Recently, progress in mass spectrometry system, such as surface enhanced laser desorption/ionization time-of-fl ight (SELDI-TOF), has opened up interesting perspectives for identifying these markers or establishing specifi c protein profi les that may be used for cancer diagnosis (Adam, 2002;Petricoin, 2002;Zhang, 2004;Solassol, 2006). In this work, we considered SELDI raw data in attempt to discriminate cancer from benign diseases. After protein ionization and desorption with a laser, the mass spectrum is represented by the intensity of the proteins fi xed on the chip (y-coordinate) as a function of the mass-to-charge (m/z) ratio (x-coordinate). From the spectra, the initial pre-processing steps are (a) the normalization and calibration to limit any bias caused by the instruments or the operator, (b) baseline subtraction, (c) peak detection, and (d) peak alignment to allow the same x-coordinate in all the spectra. One of the best challenges and the most important steps is then to reduce the high-dimension of these spectra to extract the discriminatory features or the best combination of markers capable of differentiating between two classes of interest (Duda, 2001). For this last step, spectra are processed using computerized algorithms based on multivariate statistical analyses. Several different mathematical algorithms have been applied to elucidate statistically signifi cant differences such as cluster analysis, genetic algorithms, discriminate analysis, neural networks or hierarchical classifi cation (Bauer, 1999;Petricoin, 2002;Vlahou, 2003;Wu, 2003).
In this work, we developed a three-step strategy to extract markers or combination of markers from high-dimensional SELDI data. From the pre-processing step, we detected 228 peaks, but among them some were not characteristic of the disease and were identically expressed in the two considered groups (cancer and benign disease). To allow an optimal identification of differentially expressed peaks, a preselection strategy of discriminating biomarkers combinations was chosen, rather than a simple filtering of the data by only a two-sided statistical test which was not taking into account the biomarkers inter-correlation. Next, we focused on different supervised classification methods due to the consideration that an a priori information coming from the training sample can allow the identification of the optimal diagnostic combinations. We compared the performance and the robustness of these various supervised classification methods and discussed their respective strengths and weaknesses.

Data Set and Pre-Processing
The study involved a total of 170 serum samples collected at the Arnaud de Villeneuve University Hospital (Montpellier, France) with institutional approval: 147 patients with pathologically confi rmed cancer and 23 patients suffering from a benign disease in the related organ. Whole blood was collected during fasting and all samples were processed within 1 h of collection and rapidly frozen at -80 °C before analysis. An anionexchange fractionation procedure was performed before surface-enhanced laser desorption/ionisation time-of-fl ight mass spectrometry analysis. Serum samples were thus separated into six different pH gradient elution fractions, referred as to F1, F2, F3, F4, F5 and F6. Each fraction was randomly applied to a weak cation exchange ProteinChip array surface (CM10) in a 96-well format. F2 was not subjected to analysis due to the weak number of peaks detected in preliminary experiments. Arrays were read on a Protein Biological System II ProteinChip reader (Ciphergen Biosystem). Peak detection was performed using the ProteinChip Biomarker software (version 3.2.0, Ciphergen Biosystem Inc.). Spectra were background subtracted and the peak intensities were normalized to the total ion current of m/z between 2.5 and 50 kDa. Automatic peak detection was performed in the range of 2.5 to 50 kDa with the following settings: i) signal-to-noise ratio at 4 for the fi rst pass and 2 for the second pass, ii) minimal peak threshold at 15% of all spectra, iii) cluster mass window at 0.5% of mass. The resulting CSV fi le containing absolute intensity and m/z ratio was exported into Microsoft Excel (Microsoft, Redmont) for subsequent analysis.

Biomarkers Selection
Initially, a selection of the most discriminating biomarkers was carried out. The 228 peaks detected by Ciphergen software are aligned. A peak is defi ned as discriminating when the intensities of the individuals of the cancer group are signifi cantly different than the reference group. Initially the peaks differentially expressed in the two groups were selected using the two-sided Wilcoxon's test. After this preselection, a combination of discriminating peaks is required by using a logistic regression (Pepe, 2006).

Wilcoxon's test
The assumption that each peak intensity follows a normal distribution has been rejected using a Shapiro-Wilk normality test in each group. A twosided Wilcoxon's test was employed to test the H 0 assumption of equality of the intensities in the two groups. We correct the loss of power induced by multiple tests by the false discovery rate (FDR) approach (Verhoeven, 2005;Benjamini, 1995). FDR is the expected proportion of type I errors among all signifi cant results (V/r), where V is the number of type I errors (''false discoveries"), and r is the number of signifi cant tests. A procedure to control FDR at level α was proposed by Benjamini and Hochberg (1995). That consists initially of ranking by ascending order the 228 p-values that we note now by : p (1) Յ p (2) Յ … Յ p (228) , and H (i) denote the null hypothesis corresponding to p (i) . The second stage consists of the search for k which is largest i for which: This resulting p-value p (k) is the threshold p-value for each test taken individually, such as we reject all the null assumptions H (1) , …, H (k) (Fig. 1). The null assumption has been rejected for k = 100 biomarkers.

Binary logistic regression
The Wilcoxon's retains the most discriminating peaks. On the other hand, the logistic regression combines several biomarkers to fi nd the best model allowing classifi cation in cancer/control groups. Let us consider the diagnosis variable Y to be modelled, which takes two values: • Y = 1 for all the individuals that belong to the cancer group. • Y = 0 for all the individuals that belong to the control group.
The outputs to be modelled Y i |x i follows a Bernouilli distribution of parameter π i = P(Y i = 1|x i ), where x i is a vector line of the actual values for the explanatory variables. The logit of the multiple logistic regression (Hosmer, 2000) is given by where (x 1 ,…, x p ) is a collection of p biomarkers selected in the model. The classical model-building strategy is to fi nd the most parsimonious model that explains the data. This provides a general and numerically stable model. To study the robustness of the logistic regression predictor's selection, the two strategies of models-building forward and stepwise were employed. The signifi cance level of the score chi- square for entering an effect into the model was fi xed at 0.05 in the forward and stepwise logistic regressions. A signifi cance level of 0.05 is considered in the Wald chi-square test to test if an effect must stay in the stepwise logistic regression which was implemented with SAS software (version 8.1). A weight of 1 and 147/23 was affected to the cancer and control groups, respectively. This weighting was employed because the control sample is subsampled. The estimated logit in the forward selection is given by the following expression: where P j Fk designed the jth peak in the original biomarkers matrix which contains 228 peaks and Fk indicates that this peak was detected in the kth fraction. The AUC for the two models were 0.988 for the forward strategy and 0.995 for the stepwise strategy. The forward and stepwise logistic regression modelled the logit with 6 and 8 biomarkers ranging from 3 to 48 kDa. The most stable peaks were the 5 common peaks of the two models i.e. P P P P P for an AUC of 0.973. These three models produced above were used in the comparison of the supervised classifi cation techniques. For the moment, the AUC is the only criterion which makes it possible to evaluate the model in term of classifi cation.

Overview of the Supervised Classifi cations Used
Supervised classifi cation techniques consist in a defi nition of a classifi cation rule based on a training set for which the true class-label is known. The matrix x denotes the biomarkers matrix with n = 169 lines (i.e. individuals) and p columns, where p is the number of biomarkers retained in the different logistic regressions. Also, x 1 ,…, x p denote the p biomarkers contained in the columns of x. The training data consist of N pairs {(x (1) , y (1) ), (x (2) , y (2) ), …, (x (N) , y (N) )} with x (i) ∈ℜ p and the N class-labels corresponding y (i) ∈{−1,1}. In the training set, the true class-label y (i) adopted in the next sections are the following: where, the ith line of the matrix x (i) represents the p coordinates of the ith training sample point.

Support vector machine
A Support Vector Machine (SVM) is a supervised learning technique that constructs an optimal separating hyperplane from the training set with an aim of classifying the test set (Vapnik, 1998;Hastie, 2001;Lee, 2004;Li, 2004). When the data are not linearly separable, one solution for the classifi cation problem is to map the data into the feature space that is usually a higher-dimensional space using a function φ usually non linear. Thus, for x (i) the ith vector in the original input space φ(x (i) ) is the corresponding vector in the feature space. The value of the kernel function K on (x (i) , x (j) ) computes the inner product of φ(x (i) ) and φ(x (j) ) in the feature space. The radial basis kernel is employed in this article. Its formulation is the following: where γ > 0 is a constant and ξ i are the slack variables. This optimization problem is solved by maximizing the Lagrangian dual objective function. The solution β for β has the form of linear combination of the terms y x 1 for each i. The decision function can be written as In other words, if Ĝ(x (j) ) = −1 then j th observation x (j) of the sample test has a class-label y (j) equal to -1 and will belong to the cancer group, else the class-label is equal to 1 and the individual x (j) will belong to the control group.

Linear discriminant analysis and quadratic discriminant analysis
Suppose that f k (x) is the density of the observations x in the k th class, and π k denote the prior probability of class k, where π k k = ∑ = 1 2 1. The Bayes theorem gives us The classifi cation rule for the test set is to affect the observation x' at the k th class with maximal probability P Y k X x ( ) = = ' . For linear and quadratic discriminant analysis, the densities f k are modelled as p-multivariate Gaussian (Webb, 2002;Lee, 2004). To compare the two classes k and l, the log-ratio was defi ned as

Σ Σ
• Linear discriminant analysis (LDA) arises when the covariance matrix Σ k and Σ l are assume equals Σ Σ k k = ∀ , . Then, the decision boundary between classes k and l is an equation linear in x in a p-dimensions hyperplane.
In practice, the true parameters of the Gaussian distributions are not known, but we can estimate them using the training set. Also, an estimate δ k of δ k can be obtained and the decision rule can be written as ˆ( ) arg maxˆ( ). G x x k k = δ

Single-layer neural network
Artifi cial neural networks (ANN) are learning algorithms that are modelled on the neural activity of the brain (Hastie, 2001;Dreyfus, 2002;Chen, 2004;Lee, 2004). Each node represents a neuron, and the connections represent the synapses (Fig. 2). A constant entry x 0 = 1 N is included in the whole perceptron entries, affected of a weight w 0 . The constant w 0 is often referred as the bias and -w 0 is called the threshold. Also, x = (x 0 , x 1 , x 2 , …, x p ) denote the input variables such as x j ∈ℜ N ; w = (w 0 , w 1 , w 2 , …, w p ) denote the associated weight vector. The training set is used to fi nd the appropriate values of the synaptic weights vector (w 0 , w 1 , w 2 , …, w p ) in neural networks to solve the classifi cation problem. If the two classes are linearly separable, it exists a decision boundary {x: w T x = 0}. If w T x > 0, is in the fi rst class and if w T x > 0, x is in the second class. A decision rule Ĝ(x) can be defi ned in terms of a linear function of the input x as follows where sign(z) denotes the sign of the quantity z. Let the risk function R(w) measures the success of a decision rule by comparing the true labels y (i) with the predicted labels Ĝ(x (i) ). The weight vector w is chosen to minimize the risk function. A current choice for the risk function is the sum of squared errors. The gradient descent procedure can be used to fi nd optimum weights ŵ in term of risk, and the decision rule can be written as

Classifi cation trees
A classification tree is a multi-stage decision process that divides successively the whole of the N training sample observations in two homogeneous segments with regard to the class-labels by using the p biomarkers x 1 , …, x p (Hastie, 2001;Yang, 2005). The algorithm needs to select automatically a splitting rule for each internal node. This means determining a splitting variable x j j p l l ∈{ } 1 2 , , , … with an associated threshold S l that has been used to partition the data set at each node in two regions : R L (j l , s l ) = {x|x jl ≤ s l } and R R (j l , s l ) = {x|x jl > s l }. For each splitting variable x jl , the threshold s l is determined by scanning through all of the inputs =1 2 … , and the determination of the best pair (j l , s l ) in term of maximization of the decrease in the node impurity function. In this article, the decrease in the node impurity function is expressed according to the Gini criterion. The splitting process is repeated on each of the two resulting regions of the previous step, and this until the stopping rule stops the process. The splitting process (Nakache, 2003) is stopped when the segment is pure (it contains subjects of the same class), if it contains identical observations, or if it contains a small number of subjects. Then this large tree is pruned using cost-complexity pruning. The fi nal tree retained is noted by T â (Fig. 3). For the K class (here K = 2) and the M nodes in the fi nal tree T â , the proportion of class k observations in terminal node m was computed as In other words, in the particular case where K = 2 for each fi nal node m of the fi nal tree the assignment rule can be also written in the following term: If ˆ. p mk Ն 0 5 then the individuals of the node m are assigned to the class k, else they are assigned to the remaining class.

Boosting trees
The purpose of boosting is to apply M times the weak classification algorithm on the weighted training data, so as to produce a sequence of weak classifiers G m (x), m = 1,2, …, M (Hastie, 2001;Fushiki, 2006). Then, a strong classifier is built by making a linear combination of the weighted sequence of weak classifiers. For a vector variables X, a classifier G m (X) produces a prediction of the class-label Y that belongs to {-1,1}. The error rate on the training sample is where w m i ( ) is the weight associated to the i th observation of the training sample at the mth step. A weak classifier is one whose error rate is only slightly better than random guessing.
The decision rule for ( j 2 , S 2 ) where α m = log((1-ε m )/ε m ). In other words, at the step m the observations misclassified at the previous step have their weights increased (Nakache, 2003), and on the contrary the weights of the well classifi ed observations are decreased. The predictions from all of them are then combined trough a weighted majority vote to produce the fi nal prediction:

Cross-validation
The cross-validation is applied on biomarker selections combined with different classifi cation methods. The logistic regression, which took part at the preselection step, was also used as a discriminant analysis method in the cross-validation. The aim of this step is to validate our method of marker selections, while comparing the predictive power of the different supervised classifi cation methods with this selection method. We applied the holdout method for the cross-validation. That consists on repeating the algorithm of decision rules construction described below, and to estimate their performances. First, the cross-validation consists on a random drawing of a training sample. The training sample size N was varied from 40%, 60%, and 80% of the total sample size n = 169. The remaining sample is named test sample. The features number of the training sample is limited to these p most discriminating features described above, such as x ∈ℜ N×p . The decision rule G(x) is evaluated using the training set whose classlabels are known, and that for the different supervised classifi cation techniques studied in this article. The class-label of each test sample observation is predicted using the decision rules G(x). The classlabels of the test sample being known, the predictions of the different methods can be evaluated by the calculation of TP, TN, FP, FN, where TP, TN, FP, FN means the number of true positive, true negative, false positive and false negative samples, respectively. These numbers are computed in each test set of the 1000 iterations of the cross-validation and summed. For each classifi cation method, the sensitivity, the specifi city and the accuracy was calculated to compare them. The sensitivity is defi ned as TP/(TP+FN) which represents the ability of a classifi cation method to classify correctly the patients reached of cancer, and the specificity defi ned as TN/(TN+FP) the percentage of observation of the control sample correctly classifi ed. The accuracy is defi ned as (TP+TN)/(TP+TN+FP+FN) and measures the percentage of whole of observations correctly classifi ed. The cross-validation was applied to the three biomarkers selections using, under the R software, the package CaMassClass (www.r-project.org) dedicated to the treatment of Protein Mass Spectra (SELDI) Data.

Results and Discussion
The goal of our article was to detect biomarkers and to assess their discriminating capacity using the different several supervised classification methods. In this way, we believe that cross-validation can answer to this question. We developed a threestep strategy to extract markers or combination of markers and to evaluate the robustness of these classifi ers. First, protein peaks from 228 protein clusters were selected by a Wilcoxon test. Then, logistic regression models were used to construct two discriminating subsets of features composed of 6 and 8 protein peaks, ranging from 3 to 48 kDa, using forward (Table 1) and stepwise (Table 2) logistic regressions respectively. A third subset of discriminating markers (Table 3) was built by taking the intersection of the two fi rst. Since there was no gold standard method for classifi cation of mass spectrometry data, we were interested in comparing the performance and the robustness of different classifi cation approaches. The unsupervised and supervised classifi cation methods have been evaluated, but only the latter that showed the most satisfying results was presented in this paper. The mean performance (accuracy, sensibility and specifi city) of our classifi ers on 1000 randomly generated 80:20, 60:40 and 40:60 set of samples were evaluated using different class-prediction models. The results showed that the forward logistic regression is better than the stepwise logistic regression in terms of accuracy and specifi city. Interestingly, 5 protein peaks were common to the two models. A classifi er with the protein peaks common to these two model selection methods allowed a more parsimonious model, as effective as the forward logistic regression (Table 3). Then it can be pointed out that the specifi city was lower than the sensitivity and did not exceed the 0.86. The least effective supervised classifi cation methods was the classifi cation trees and the boosting trees that failed to correctly classify individuals of the control group. Although the sensitivity of both methods was acceptable, it was not the case for the specifi city that was found lower than 0.36. Comparing to Quadratic Discriminant Analysis, the Linear Discriminant Analysis gave the best performance result to discriminate both samples achieving a mean classification accuracy of 0.93, a sensitivity of 0.95, and a specifi city of 0.83 with a 80:20 cross-validation set samples (Table 1). The Linear Discriminant Analysis was slightly better than the Logistic Regression in terms of accuracy, and sensitivity. The results from SVM and Neural Networks were similar in terms of mean performance but showed a lower mean specificity (0.78) compared to Discriminant Analysis and Logistic Regression methods (0.86). Finally, the model selection robustness was confi rmed by using different training sample sizes that varied from 40 to 80%. Interestingly, all the selection methods were stable with all the training sample size tested, except for Quadratic Discriminant Analysis. The Linear Discriminant Analysis remained the most robust method with a mean specifi city ranging from 0.82 to 0.83, and sensitivity from 0.93 to 0.95 with the different sample sizes tested (Table 1).
We showed that Linear Discriminant Analysis, Quadratic Discriminant Analysis, Logistic Regression, Support Vector Machine and Neural Networks were the fi ve most robust supervised classifi cation methods in our study. The combination of the two-sided Wilcoxon's test and the Logistic Regression for the markers pre-selection and the Linear Discriminant Analysis seemed to be the more effective in term of classifi cation of samples in control and cancer groups. We observed that once the most discriminating markers are selected, the results of sensitivity, specifi city and accuracy can be radically different from one method to another. The choice of these classifi cation methods depends on the data, on the choice made for the pre-selection and on the problem that has to be solved. Also, it is essential to test several classifi cation methods on the selected biomarkers. The question of the bias between the selection method and the Discriminant Analysis can arise. Accordingly we evaluated the whole method (i.e. the preselection stage combined with the Discriminant Analysis) in a 5-fold cross-validation (Ambroise, 2002). If we consider the preselection method with the logistic regression forward, we found an accuracy of 0.8763, a sensitivity of 0.9027, and a specifi city of 0.7. The combination of this selection method and this classifi cation method is robust. But these performances could be better if the difference between sizes of the two groups had not been so important. The relatively low specifi city obtained with our data could be explained by this strong imbalance in the size of both sample groups, or by the choice of a control group with high-risk of developping cancer. This last condition could explain the very low specifi city observed with the Classifi cation Trees and the Boosting Trees classifi cation methods, which uses thresholds. In conclusion, this biomarkers selection method should be employed on other studies, to validate its robustness. It also would be interesting to ensure a medium term follow-up of this control group population to allow the reappraisal of benign condition and rule out the possibility of infra clinical and radiological cancer development in this group of patient. In this case, it could allow a correct reallocation of the patient in the correct group and a more effi cient re-evaluation of the different classifi cation methods. Finally, the potential markers selected should be clearly identifi ed and annotated using extra purifi cation such as standard chromatography and/or electrophoresis and analysis by peptide mass fi ngerprint using more resolutive MS techniques or peptide sequencing via tandem MS analysis. This identification presents several interesting features, particularly during the discovery phase, by adding a supplementary validation phase using independent immunological methods, such as ELISA, and by increasing the predictive value of the molecular signature.